# Making Sense of Data

Last week we did some basic EDA (Exploratory Data Analysis) with the MetOffice weather data, but it was limited to mainly looking at a few graphs right at the end of the practical. Thanks to that, depending on the data you got you may spotted a 'problem' or two (_e.g._ temperature readings at the start or end of the time series that were basically 0).

This week we want to tackle this in a more systematic way. We are going to switch data sets because the Socio-Economic data is easier to manipulate than the weather data and it has a few features that are particularly useful for demonstrating the value of transformation and standardisation.

## Introduction: all models are wrong, but some are useful

The statistician George Box [once said](https://en.wikipedia.org/wiki/All_models_are_wrong) "all models are wrong but some are useful". Now you might think that 'wrong' _necessarily implies_ uselessness, but this aphorism should tell you that things are a lot more interesting than that: let's review the idea of statistics as the study of a '[data-generating process](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9639.2012.00524.x/full)'. 

The data that we work with is a _representation_ of reality, it is not reality itself. Just because I can say that the height of human beings is normally distributed with a mean of, say, 170cm and standard deviation of 8cm doesn't mean that I've _explained_ that process. What I have done is to say that reality can be reasonably well approximated by a data-generating process that uses the normal distribution. 

Given that understanding of height distributions, I know that someone with a heigh of 2m is very, very improbable. Not impossible, just highly unlikely. So if I meet someone that tall then that's quite exciting! But my use of a normal distribution to represent the empirical reality of human height doesn't mean that I think it is _actually_ distributed randomly amongst all human beings in a gigantic lottery system: some parts of the world are typically shorter, other parts typically taller; some families are typically shorter, while others are typically taller...

So the _real_ reason for someone's height is to do with, I assume, a mix of genetics and nutrition. However, across large groups of people it's possible to _represent_ the cumulative impact of those individual realities with a simple normal distribution. And using that simplified data-generating process allows me to do things like estimate the likelihood of meeting someone more than 2m tall (which is why I'd be excited to do so, though not as excited as the guy in the next video...).

Here's a (genuinely terrifying) video that tries to explain this whole idea in a different way:

[![From reality to make-believe](http://img.youtube.com/vi/HAfI0g_S9oo/0.jpg)](https://www.youtube.com/watch?v=HAfI0g_S9oo)

In the same way, real individuals earn different incomes for all sorts of reasons: skills, education, negotiation ability... and, of course, systematic discrimination or bias. Because of wide variations in individual lived experience, it's quite hard to _prove_ that any _one_ person has been discriminated against unless you have the 'smoking gun' of an email or other direct evidence that this has happened. 

But if I have data on _many_ men and women (from a company, industry, or society) to work with, then I can take a look at what data-generting processes best describe what I've observed. And I can also create a data generating process that would describe what I'd _expect_ to see if no systematic discrimination were taking place at all. I make that sound simple but, of course, it's really hard to do this properly: do you account for the fact that discrimination has probably happened throughout someone's lifetime, not just in their most recent job? And so on.

But, when we have created a data-generating process that captures this at what we feel is an appropriate level, then the analytical process becomes about testing to see if there is a _significant_ difference between what I expected and what I observed. Once we've done that, then we can start to rule out claims that 'there are no good candidates' and the other defences of the indefensible. It is always theoretically _possible_ that a company had trouble finding qualified candidates, but as you put together the evidence from your model it may well  become increasingly _improbable_.

So, always remember that, while the data is not reality, it _is_ a very useful abstraction of reality that allows us to make certain claims about what we expected to see. Linking our observations to what we know about the characteristics of the data generating process then allows us to look at the _significance_ of our results or to search for outliers in the data that seem highly improbable and, consequently, worth further investigation.

In [None]:
import pandas as pd
df = pd.read_csv('./Data/Data_NSSHRP_UNIT_URESPOP.csv')
df.head(3)

## Controling Headers in Pandas

Of course, if we had to open this file in Excel in order to remove the extra rows then that wouldn't be very useful, would it? Now it's time to see what we can do in pandas to tidy this up _when_ we do the import. I'd suggest looking at the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for `read_csv`, and in particular at the following options that can be _passed_ to `read_csv()`:
* header
* names
* usecols
* skiprows
* nrows

By way of guidance:
* I would _suggest_ that you skip one row
* I would _suggest_ that you drop one of the columns immediately after the import
* I would _suggest_ that you specify your own column names only after loading the data
* I would _suggest_ that while you get this right you only read a small number of rows

Try to develop the answer yourself by adding one parameter at a time to _build_ a working import statement (don't try all of these options at once!). This is critical: you _will not get it right first time_, so you need to take an incremental approach and assemble the code a little bit at a time. We keep coming back to this, but it _always_ bears repeating: rather than write all of your code and then cross your fingers, start with a small thing that _works_ and then build outwards from there. I've given you one parameter to get you started...

Now over to you! *[Note: if you get stuck, answer at the end of the notebook.]*

In [None]:
df = pd.read_csv(???, nrows=???)
df.head()

### Finally

To ensure that you end up with the same column names that I do, please add this to your code in the coding area below (and add some comments so that you know what's going on!):
```python
colnames = ['CDU','GeoCode','GeoLabel','GeoType','GeoType2','Total']
for i in range(1,9):
    colnames.append('Group' + str(i))
colnames.append('NC')
df.columns = colnames
```

You've now written the components of a Python script that can take a file downloaded from InFuse, automatically extract the contents from a Zip archive, and then load the data into pandas automatically. Now that we've done it for _one_ file, we can work out how to do it for _any_ file. That's what we mean by scalability: yes, the column names will be different for other files downloaded from InFuse, but the _process_ is the same: we could create a function that handles all of this for us and the only thing it would need is the names that we want to use for the columns! 

# Working with a Data Series

We've already done quite a bit with the Series (i.e. column) class offered by pandas, but I want to revisit it so that you understand why getting to grips with how the series works (especially the 'index', which is a special type of series) is crucial to getting the most out of pandas.

## Adding a New Series

You saw last week that you can add a new series to an existing data frame using the dictionary-like syntax:
```python
df['<new series name>'] = pd.Series(... <series definition> ...)
``` 
But just to remind you: see how familiar that syntax is? `df['NewSeriesName']` is _exactly_ like creating and assigning a new key/value pair to a dictionary! The only difference here is that the 'value' we store in the dictionary is a Series object, and not a simple variable (String, int, float).

### String Formatting for Pretty-Printed Numbers 

Notice also the `<string>.format()` command I’ve used here: `{0:> 8.1f}`. In order to understand how this works for formatting the results in a nice, systematic way you will need to read  [the documentation](http://www.python.org/dev/peps/pep-3101/)

The 'pep' tells you that:
* `{0}` tells Python to grab the first value inside the parentheses (`format(... values ...)`) and to stick it into the string at this point, but `:...` tells Python that we also format the string in a particular way specified in the `...`.
* `>` tells Python that the string should be right-aligned.
* The space (' ') next to the > says that any 'fill' should be done with whitespace (you could also do it with a 0).
* `8.1f` tells Python to treat anything it gets as a float (even if the variable is an int) and to format it for having 1 significant digit after the full-stop, and a total of 8 digits in all (which ties us back to the right-alignment up above). If you give it a number that has more than 3 digits to the left of the full-stop then it will still print them out,.

Here are some suggestions to better-understand what’s going on:
* Try changing the > to a < to see what happens to the alignment (then change it back)
* Try changing the .1 to a .0 to see what happens to the alignment and precision
* Try changing the .0 to a .6 to see what happens to the alignment and precision
Do this in the coding area below.

In [None]:
print "\tMedian:  {0:???f}".format(series.median())

## Sorting Data

In [None]:
df.sort_values(by='Group1Pct', ascending=False).head(10)[['GeoLabel','Group1','Group1Pct']]

In [None]:
df.sort_values(by='Group1', ascending=False).head(10)[['GeoLabel','Group1','Group1Pct']]

Just so that you understand what we just did with this:
1. Take the data frame `df`;
2. Sort it by descending order;
3. Take the first ten values;
4. Print out the columns specified by the list.

Let's pull it apart step-by-step at the code level:

* The first step in this process is `df.sort_values` -- you can probably guess what this does: it sorts the data frame!
* The parameters passed to the `sort_values` function are `by`, which is the column on which to sort, and `ascending=False`, which gives us the data frame sorted in _descending_ order!
* The output of `df.sort(...)` is a _new_ data frame, which means that we can simply add `.head(10)` to get the first ten rows of the newly-sorted data frame.
* And the output of `df.sort(...).head(...)` is yet _another_ data frame, which means that we can print out the values of selected columns using the 'dictionary-like' syntax: we use the outer set of square brackets (`[...]`) to tell pandas that we want to access a subset of the top-10 data frame, and we use the inner set of square brackets (`['GeoLabel','Group1','Group1Pct']`) to tell pandas which columns we want to see.

I'd say 'simples, right?' but that's obviously _not_ simple. It _is_, however, very, very _elegant_ because it's quite clear (once you get past the way that lots of methods can be chained together) and it's very succinct (we did all of that in _one_ line of code!).

### Quick Quiz

So, given these results, which do you think is a _better_ answer to the question: where is there the highest concentration of highly-skilled 'NS-SeC Group 1' residents in England and Wales? Were these the answer that you were expecting?

## Outliers

In [None]:
import numpy as np

# Generate a pandas series containing a normal distribution
# of the same size as the LSOA data frame
nDist = pd.Series(np.random.normal(size=df.shape[0]))

# How many of them fall outside the 99.9% CI? (1-tailed?)
outliers = nDist[ nDist > 3.09 ]

print("We selected " + str(outliers.shape[0]) + " of " + str(nDist.shape[0]))
print("\tThat's {0: .4f}%".format(float(outliers.shape[0])/nDist.shape[0]))

In [None]:
# And plot, just to show that it _was_ a normal distribution
nDist.plot.hist(bins=20)

Now let's compare that to what we get with our NS-SeC data:

In [None]:
dfG1 = df[df.Group1ZStd >= 3.09]
dfG4 = df[df.Group4ZStd >= 3.09]

print("We selected " + str(dfG1.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG1.shape[0])/df.shape[0]))

print("We selected " + str(dfG4.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG4.shape[0])/df.shape[0]))

So in both cases we got _quite_ a few more 'significantly' over-represented LSOAs than we would have expected if the real distribution had closely matched a random normal one. Without even looking at the plots, we could tell that both of these distributions are positively skewed but with the plots it's super-obvious. So can we use the z-score to make these selections... given that we can tell the mean is not _necessarily_ the best measure of central tendency?

## Statistics as Judgement

There is no simple answer to when the mean is, or is not, a good measure to use in an analysis. We'll spend a lot more time on this over the coming weeks, but the answer _seems_ to be: it depends what you are trying to accomplish.

Let's say that you were a researcher planning to undertake some in-person interviews in areas that have high concentrations of well-off households (Group 1, basically). How would you go about choosing which LSOAs to target? How could we use the z-score to make this process a little less 'random' than picking a couple of areas that you know well and trundling off to find... exactly what you expected because your choice was influenced _not_ by data but by _personal bias_?

Let's look at this way:
1. Remember that _if_ something followed a normal distribution (2-tailed) then you would _expect_ 95% of values to fall within the range of -1.96 to +1.96 standard deviations from the mean. So if you wanted 5% 'significance' then you could say "I'm going to select everything outside of this range as being _significantly_ over- or under-represented in terms of the Group 1 population at the 5% significance level."
2. If you wanted to select only the _real_ outliers (say, 99.9% 'significance') then you want that z-score of 3.29 that we used above. If we can see that there are values _way_ beyond that level then we can be pretty certain that there is something non-random happening there... which could be problematic if we thought our model was representing reality but _not_ if we are saying "naively, we expected reality to work like a random normal, so departures from that expectation are interesting." 
3. The point is that we are moving away from just saying "Well, it's more than average so it's interesting'. We are actually filtering out a lot areas that might plausibly have higher (or lower) concentrations **'by chance'** _if a normal distribution is a reasonable representation_. We know it's not, but we're stating here that our _naive_ expectation is that people are distributed at random, so what's interesting is the areas where our results show this obviously is _not_ the case. Do you see what I meant above about subtract what we _expected_ from what we _observed_?
4. So rather than just saying "25% of Group 1 households is a lot" or "36% of Group 4 households is a lot" we are finding a way to use the z-score to set a threshold that is less influenced by our own ideas about what is 'significant' and more influenced by what the data actually _says_ is significantly outside 'the norm'.

The _only_ problem with this approach in this case is that we can tell that our expectation of normality isn't very accurate. The positive skew of Groups 1 and 4 means that we're getting _too many_ areas being flagged as 'significantly over-represented by Group X'. If we were to pick a better distribution, or if we could 'make' the Group distribution _appear_ more like a normal distribution then the z-score would work better for selecting the really interesting cases...

*[Note: if you started from an assumption that similar people tend to cluster together in space then you'd be creating a distribution that was closer to reality... and then you'd be using spatial statistics! But you **can** use non-spatial assumptions in a spatial context with care. Yeah, it's kind of hard to get your head around.]*

This is why we had the log-transformation:

In [None]:
df.Group1Log.plot.hist(bins=20)

That's still not perfect but it's a lot better than the hugely skewed data we had before! If you _really_ care, then there's a [Box-Cox transform](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html) in `scipy.stats` to get something that is almost _exactly_ normal.

Now, if we calculate a z-score from the log-transformed data to select areas that fall outside the 95% CI (2-tailed) then we get:

In [None]:
df['Group1LogZStd'] = (df.Group1Log - df.Group1Log.mean()) / df.Group1Log.std()
dfG1LogZ = df[df.Group1LogZStd >= 1.65]

print("We selected " + str(dfG1LogZ.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG1LogZ.shape[0])/df.shape[0]))

# Visualising Data

If we weren't learning how to program at the same time as we learn to do data analysis then my recommendation would have been this: **start with a chart**. There is _no_ better tool for understanding what is going on in your data than to visualise it, but we couldn't show you how to make a plot without first teaching you how to load data and perform some basic operations on a data frame! Now that we've done *that*, we can get to grips with VDQI (the [Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi) and how this supports our understanding of the data.

## Why Seaborn

We've already done some straightforward plotting directly from pandas, but for the data visualisation part of the practical we are going to use the [Seaborn package](http://stanford.edu/~mwaskom/software/seaborn/) because it provides a lot of quite complex functionality (and very pretty pictures) at quite low 'cost' (i.e. effort). There are, however, other options out there that are worth checking out if you take things further; the two that you are most likely to hear mentioned are: [Bokeh](http://bokeh.pydata.org/en/latest/) and matplotlib. 

1. Bokeh is, like, Seaborn designed to make it easy for you to create good-looking plots with minimal effort. 
2. Matplotlib is a different beast: it is actually the _underlying_ library that supports the majority of plotting (drawing graphs) in Python. 

So Seaborn and Bokeh both make use of the matplotlib library to create their plots, and if you want to customise a figure from either of these two libraries then you will eventually need to get to grips with matplotlib. The reason we don't teach matplotlib directly is that it's much harder to make a good plot and the syntax is much more complex.

A more recent entrant is ŷhat's ggplot library, which deliberately mimics R’s ggplot2 (http://ggplot.yhathq.com) -- this library has become the dominant way of creating plots in the R programming language and it uses a 'visualisation' grammar that many people find incredibly powerful and highly customisable. Unfortunately, ggplot on Python does not currently support mapping (which R does in ggplot2).

## Loading Seaborn 

As with other libraries that we’ve used, we’ll import Seaborn using an alias:
```python
import seaborn as sns
```
So to access Seaborn's functions we will now always just write `sns.<function name>()` (where `<function name>` would be something like `distplot`). Try importing Seaborn...

In [None]:
import seaborn as sns

If you've not made any changes to your Anaconda distribution then you will probably get:
```
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-43-ed9806ce3570> in <module>()
----> 1 import seaborn as sns

ImportError: No module named seaborn
```
So, what's the error?

To fix it, we'll need to install Seaborn manually. The best way to do this is to open up the Terminal.

[![Installing Packages with Anaconda Python](http://img.youtube.com/vi/P0hA3vMUBuU/0.jpg)](https://www.youtube.com/watch?v=P0hA3vMUBuU)

## Making a Distribution Plot

One of the most useful ways to get a sense of a data series is simply to look at its overall distribution. Something like this:
```python
%matplotlib inline
...
sns.distplot(<data series>)
```
The `%matplotlib inline` command only need to be run _once_ in a jupyter notebook; it tells jupyter to show the plots as part of the web page, rather than trying to show them in a separate window. 

In [None]:
%matplotlib inline
import seaborn as sns

df.g1LQ = (df.Group1 / df.Total) / (float(df.Group1.sum()) / df.Total.sum())

df.g1LQ.describe()

sns.distplot(df.g1LQ)

If all at went well you should see a nice distribution plot for the Group 1 Location Quotient! 


### Recap

OK, I want you to take a second here: although there was a lot of setup work that needed to be done, we just created a distribution plot in one line of code. One line. This is a more sophisticated plot than you could ever create in Excel and you just created it in one line of code. 

Try producing similar plots for some of the other groups. You should be able to do this using two things:
* A `for` loop to iterate over the column names
* String interpolation to create a new LQ column name

I'll get you started... this also demonstrates how we can use pandas' dictionary-style notation to create a new string that we can then use as a key for a new data series.

In [None]:
for c in ???:
    df[c + 'Lq'] = (df[???] / df['Total']) / (float(df[???].sum()) / df[???].sum())

    print(???.describe())

    sns.distplot(???, hist=False)

If all has gone well, then you should have multiple overlapping figures showing the distribution of each NS-SeC Group in the downloaded data. Do you understand more about the data and its distribution now?

And did you notice how we turned off the histogram (which would have been confusing with 8 overlapping plots) using the `hist=False` option? More on this, as always, using `help(sns.distplot)` or [RTM](https://seaborn.github.io).

## Saving a Plot

Saving a plot isn’t quite as easy as creating one... but wouldn’t be it a lot easier if we could save our plot automatically and not have to even touch a button to do so? This is where we need to use matplotlib syntax (and where you'll see why we opted not to spend too much time on it):
```python
# Simple save

import matplotlib.pyplot as plt # Plotting library needed by seaborn
series = df.Group1Lq # Copy the data series to a new variable
fig = plt.figure(series.name) # Create a figure using a title based on the series name
sb.distplot(series) # Plot the series to the open 'figure' device
fig = plt.gcf() # *G*et the *C*urrent *F*igure environment so that the next command works
plt.savefig("{0}-Test.pdf".format(series.name), bbox_inches="tight")
plt.close() # Close the plot so that nothing else overwrites our work

```
To explain what's happening here: 
1. We import a sub-package of `matplotlib` that gives us access to all kind of useful functions.
2. We copy the Group1Lq data series to a new variable (this would mean that, to print out Group2 or Group8 we'd only have to change this one line... see: we're preparing to use a `for` loop!)
3. We create a `figure` object into which Seaborn can 'print' its outputs.
4. We call Seaborn and ask it to print (it doesn't really need to know that it's printing to something, it just nees to know what 'device' is should use for output).
5. We save the plot, using the `format` command to replace the `{0}` with the name of the data series (so in this example we'd be saving our figure to `Group1Lq-Test.pdf`).
6. We then close the figure output so that we don't print other plots over top of it -- that's what happened in the earlier code when we used the `for` loop!
 
The plot should have been saved to your working directory (where the Jupyter notebook is running) using the name of the data series that you were working with. We want to use string replacement (the `{0}`) so that when we save the plot for Group3Lq or Group4Lq we don't overwrite the one for Group1!

### Recap

If you think back a little bit to where we started 6 weeks ago, you’ll see just how far we’ve come, and just how far you can go now: the ability to automatically load, clean, process (e.g. standardise and filter), and print out data in a variety of forms is incredibly powerful. You could repeat an entire analysis simply by changing the input file (assuming the format doesn’t change too much). 

It’s the automation component that sits at the heart of geocomputational techniques -- you still need individuals and their judgement to figure out what to do and what to produce, but once you’ve done that you can make the computer do the boring stuff while you focus on the interpretation and the meaning of the results!

# Automation with Loops & Functions

When we're undertaking an analysis of a data set, we often have to perform the same (or at least similar) tasks for each weather station or socio-economic class or ethnic group. We *could* copy and paste the code, and then just change the variable names to update the analysis... but that would be a definite instance of what Larry Wall would have called 'false laziness': it seems like a time-saving device in the short run, but in the long run you've made your code less readily maintainable (what if you want to _add_ to your analysis or find a bug?) and less easy to understand.

There are nearly always two things that you should look at if you find yourself repeating the same code: 
1. write a for loop; 
2. consider writing a function.

Why these two strategies?

## Automating Summaries

Let’s start with a for loop to generate summary statistics and a distribution plot for several of our data columns – you’ve already typed all of this code more than once so all we need to do is take the right bits of it and put them in a for loop like this:
```python
for series in ...:
    print 'Sumarising ' + series.name
    
    print "\tMean:    {0:> 9.2f}".format(series.mean())
    print "\tMedian:  {0:> 9.2f}".format(series.median())
```
 
All you need to do is figure out which data series go into the for loop in the first place and then your code will run for each one, producing a nicely formatted summary for each.

### Automating Analysis

It is much, much harder to automate the analysis than it is to automate the summary, but there are certain things that we can do to make life easier -- transformation and standardisation is one of the easiest things to do as part of automation!

Here’s some code to generate the additional columns for our analysis on the pattern that we created for Group 1:
```python
imprt numpy as np
for group in ['Group1','Group3',...,'NC']:

    # Log Transform
    df[group+'Log'] = pd.Series(np.log(df[group]))
    
    # Proportional Transform
    df[group+'Pct'] = pd.Series(df[group]/df.Total)
    
    # Location Quotients
    df[group+'LQ'] = pd.Series( df[group+'Pct'] / ( float(df[group].sum()) / df.Total.sum() ) )
    
    # Range Standardisation
    df[group+'RangeStd'] = pd.Series( df[group] / ( float(df[group].max()) - df[group].min() ) )
    
    # Z-Score Standardisation
    df[group+'Z'] = pd.Series( (df[group]-df[group].mean()) / df[group].std() )
```
 
Do you see how that works? We use the group names as keys to access the data in the data frames (using the `df[<column name>]` syntax), but we also use them as strings to create the new column names by doing: `<group name> + typeOfTransformation`.

Make sure that you understand how this works before moving on to the next bit, which is harder.

You can check your understanding in the code block below:

In [None]:
df.describe()

### Improving our Automated Summaries

Seaborn is a handy library because it makes a lot of difficult plots quite easy, but to help us make sense of the data it will be useful to add some additional information to our distribution plots: lines to show the location of the mean, median, and outlier thresholds.

To do this, we need to get at the library that Seaborn itself uses: `matplotlib`.

In [None]:
import matplotlib.pyplot as plt

# Setup work -- enables parameterisation
series = df.Group1Lq
fig = plt.figure(series.name)

# Create the plot
d   = sns.distplot(series)

# Now add mean and median
plt.vlines(series.mean(), 0, 1, colors='red', linestyles='dotted', label='Mean')
plt.vlines(series.median(), 0, 1, colors='green', linestyles='dotted', label='Median')

# Add outlier marks
iqr = series.quantile(0.75)-series.quantile(0.25)
if series.quantile(0.25)-1.5*iqr > 0:
    plt.vlines(series.quantile(0.25)-1.5*iqr, d.get_ylim()[0], d.get_ylim()[1], colors='blue', linestyles='dotted', label='Lower Outlier')
if series.quantile(0.75)+1.5*iqr > 0:
    plt.vlines(series.quantile(0.75)+1.5*iqr, d.get_ylim()[0], d.get_ylim()[1], colors='blue', linestyles='dotted', label='Upper Outlier')


In [None]:
help(plt.vlines)

Remember that you type `help(plt.vlines)` to discover what parameters that function takes. 

Try changing the style of the median to a solid green line to make it easier to distinguish from the line marking the mean.

Why do you think I put an `if` condition on the outlier lines? Would this always be appropriate? Why? Why not?

## Other Types of Plots

Using our subset, let’s have a look at some other types of plots...

In [None]:
%matplotlib inline
b = sns.boxplot(x='LA', y='Group1', data=sdf.sort_values(by='Group1'), palette='PRGn', fliersize=3, linewidth=1)
sns.despine(offset=10, trim=True)
for item in b.get_xticklabels():
    item.set_rotation('vertical')
plt.title("Group 1 Distribution")

In [None]:
b = sns.boxplot(x='LA', y='Group1Log', data=sdf.sort_values(by='Group1Log'), palette='PRGn', fliersize=3, linewidth=1)
sns.despine(offset=10, trim=True)
for item in b.get_xticklabels():
    item.set_rotation('vertical')

In [None]:
sns.factorplot(x='LA', y='Group1Z', data=sdf, kind='violin')

In [None]:
sns.regplot(x='Group1Pct', y='Group3Pct', data=sdf)

In [None]:
sns.lmplot(x='Group1', y='Group8', col='LA', data=sdf,
          col_wrap=3, ci=None, palette='muted', size=4, 
          scatter_kws={'s':50, 'alpha': 1})

In [None]:
sns.jointplot(sdf.Group1, sdf.Group8, kind='hex', color='#4CB391')

In [None]:
sdf2 = sdf[ ['Group2Lq', 'Group3Lq', 'Group4Lq', 'LA'] ]
sns.pairplot(sdf2, hue='LA', palette='husl').add_legend()

## 3D Plots
Finally, I wanted to show you how to create a 3D scatter plot – in this case the plot doesn’t add a lot to our understanding of the data, but there are cases where it might and it does illustrate how pandas, seaborn, and matplotlib work together to produce some pretty incredible outputs.
  
I would encourage you to look into the options in more detail:
* Can you change the colour map used to indicate which Borough each LSOA is drawn from?
* Can you change the icons used to mark each Borough so that they are different?
* Can you add a legend to indicate which marker is for which Borough?

On with the show!

In [None]:
%matplotlib inline 
# There is an 'notebook' option for matplotlib 
# that should work, but it doesn't seem to do so 
# at the moment.

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as colours 
import matplotlib.cm as cmx

# Set up the figure
w, h = 12, 8
fig = plt.figure(figsize=(w, h))
ax  = fig.add_subplot(111, projection='3d')

# Extract a subset -- same as code above, can 
# skip this step if you've already run the 
# above code.
sdf  = df.loc[df.LA.isin(['Kensington and Chelsea','Hackney','Barking and Dagenham'])]
sdf2 = sdf[ ['Group2Lq', 'Group3Lq', 'Group4Lq', 'LA'] ]

# Set up the 3D axes
x = sdf2.Group2Lq
y = sdf2.Group3Lq
z = sdf2.Group4Lq

# Set up the colourmap so that we see
# different colours for each borough's
# data.
# From: http://stackoverflow.com/questions/28033046/matplotlib-scatter-color-by-categorical-factors
boroughs  = list(set(sdf2.LA)) 
hot       = plt.get_cmap('hot')
cNorm     = colours.Normalize(vmin=0, vmax=len(boroughs))
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=hot)

for i in xrange(len(boroughs)):
    indx = sdf2.LA==boroughs[i]
    ax.scatter(x[indx], y[indx], z[indx], c=scalarMap.to_rgba(i), marker='o')

ax.set_xlabel(x.name)
ax.set_ylabel(y.name)
ax.set_zlabel(z.name)

# Additional Resources

Here's the answer to loading the NS-SeC data:

In [None]:
import pandas as pd
df = pd.read_csv('./Data/Data_NSSHRP_UNIT_URESPOP.csv', skiprows=[1])
del df['Unnamed: 15'] # If you have this column, this deletes it

# Now just to rename the columns...
colnames = ['CDU','GeoCode','GeoLabel','GeoType','GeoType2','Total']
for i in range(1,9):
    colnames.append('Group' + str(i))
colnames.append('NC')
df.columns = colnames

df.head() # Success! 

In [None]:
# Set the display format of numbers from scientific
pd.set_option('display.float_format', lambda x: '%.3f' % x)