<div style="text-align:center">
<h1>Transforming Data</h1>
<h2>7SSG2059 Geocomputation 2018/19</h2>
</div>

## This Week’s Overview

This week we're going to explore how standardisation and transformations can help aid our data analysis. 

## Learning Outcomes

By the end of this practical you should:
- be able to standardise variables to achieve different insights about data 
- understand what z-scores are and how they are related to the standard normal distribution 
- appreciate how data can be transformed, again leading to different insights about data. 

## Statistics as Judgement

We hope that you'll go on to make lots of use of what you're learning here, but the single most important thing that you can take away from the remaining weeks of the class is the idea that statistics is not _truth_, it is _judgement_.

[![Are you above average?](http://img.youtube.com/vi/hQLCWHww9OQ/0.jpg)](http://www.youtube.com/watch?v=hQLCWHww9OQ)

### Data as Representation of Reality

The statistician George Box [once said](https://en.wikipedia.org/wiki/All_models_are_wrong) "all models are wrong but some are useful". Now you might think that 'wrong' _necessarily implies_ uselessness, but this aphorism is a lot more interesting than that: let's review the idea of statistics as the study of a '[data-generating process](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9639.2012.00524.x/full)'. 

The data that we work with is a _representation_ of reality, it is not reality itself. Just because I can say that the height of human beings is normally distributed with a mean of, say, 170cm and standard deviation of 8cm doesn't mean that I've _explained_ that process. What I have done is to say that reality can be reasonably well approximated by a data-generating process that uses the normal distribution. 

Given that understanding of height distributions, I know that someone with a height of 2m is very, very improbable. Not impossible, just highly unlikely. So if I meet someone that tall then that's quite exciting! But my use of a normal distribution to represent the empirical reality of human height doesn't mean that I think our height is _actually_ distributed randomly using a gigantic lottery system: some parts of the world are typically shorter, other parts typically taller; some families are typically shorter, while others are typically taller...

So the _real_ reason for someone's height is to do with, I assume, a mix of genetics and nutrition. However, across large groups of people it's possible to _represent_ the cumulative impact of those individual realities with a simple normal distribution. And using that simplified data-generating process allows me to do things like estimate the likelihood of meeting someone more than 2m tall (which is why I'd be excited to do so, though not as excited as the guy in the next video...).

Here's a (genuinely terrifying) video that tries to explain this whole idea in a different way:

[![From reality to make-believe](http://img.youtube.com/vi/HAfI0g_S9oo/0.jpg)](https://www.youtube.com/watch?v=HAfI0g_S9oo)

In the same way, different parts of a city may have different characteristics in terms of their physical structure, ethnic composition, affluence, etc, etc...  Because of wide variations in underlying data generating processes, different areas of a city may be similar or different to one another across  multiple different charateristics. But by comparing between locations and examining patterns (including the distriubtion of characteristics and looking at the extremes) we may be able to begin to think about what the underlying processes are. 

Let's start with the usual bits of code to ensure plotting works, to import packages and load the data into memory (with a quick check that it loaded properly):

In [None]:
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import pandas as pd
import seaborn as sns

This week, we'll use the `usecols` argument of the `read_csv` pandas method to read only the columns we'll use in this week's analysis. You maye want to learn more about the `usecols` method online, but at least check you can see roughly what it is doing in the code below. 

In [None]:
dfs = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz?raw=true',
    compression='gzip', 
    low_memory=False, 
    usecols = ['LSOA11NM','USUALRES','HHOLDRES','COMESTRES','POPDEN','HHOLDS','MedianIncome','GreenspaceArea','RoadsArea','Owned','White','Area']) 
dfs.columns

We'll also run a new bit of code to supress the warning we've been getting recently. The code ensures warnings are not displayed, making results easier to read, but beware in future that hiding these messages may lead us to miss useful information...

In [None]:
import warnings 
warnings.simplefilter('ignore')

# Standardisation

Standardisation can make comparisons between observations clearer, more fair or lead to new insights. It may also be able to help us understand what is happening at the extremes of our distributions of data. There are several different types of standardisation we can apply to data. Here we'll consider three:
1. Proportional 
2. Areal  (density)
3. Z-score

The last of these is closely tied to the idea of the standard normal distirbution which we'll also get into some detail on. 

### Proportional Standardisation

Let's assume we have questions about house ownership in London, and how it varies across space. One thing we might want to do is compare LSOAs for how 'much' ownership they have. But would comparing the total number of households own their property between LSOAs be be fair given that LSOAs have different numbers of households overall? For example, let's look at the distribution of total households across the LSOAs.

#### Task 1
Find the maximum and minumum LSOA `HHOLDS` across London (Hint: use the `describe` method on a pandas `Series`)

#### Task 2
Create a seaborn `displot` to illustrate the distribution of LSOA total households. 

There's quite a lot of variation there; for example, the LSOA with most households has more than three times as many as the LSOA with the fewest (check you can see this from your result for Task 1). Comparing the total number of owned housholds between LSOAs of varying size may not give us a very good indication of how 'much' ownership there is in an area (although this depends on your research questions).  

Another way would be to compare LSOAs using a _standardised_ measure of ownership. One of the most straight-forward ways to do this is to look at the _proportion_ of households that are owned within each LSOA. To see the difference between using unstandardised vs standardised data, let's compare the top five LSOAs by total number of owned households to the top five LSOAs by proportion of ownership.

First, the total number of owned households

In [None]:
print(dfs.sort_values(by='Owned', ascending=False).head(5).round(3)[['LSOA11NM','Owned','HHOLDS']])

Now, to standardise by proportions. Let's first calculate a new `Series` in our `DataFrame` to hold the proportions, then do the same sorting as above to find the top LSOAs by proportion: 

In [None]:
dfs['Owned_prop'] = dfs.Owned / dfs.HHOLDS

print(dfs.sort_values(by='Owned_prop', ascending=False).head(5).round(3)[['LSOA11NM','Owned_prop','HHOLDS']])

Compare the last two outputs. Hopefully you can see there are clear differences in the LSOAs shown. Check you understand why this is. 

What do you think the different rates of ownership might mean for how 'affluent' we think an area is? Is house ownership even a good indication of 'affluence'? 

**Task 3** 

To consider this last question, print the _bottom_ 5 LSOAs in terms of ownership proportion and see which parts of London they are in:  

Are they areas you expected? MAybe there isn't a direct link between house ownership and 'affluence' of a neighbourhood. Think about why that might be in a 'global city' like London.

What we've been doing above is _proportional standardisation_. 

A proportion (e.g. a percentage) of the total count contained in one LSOA is one way of standardising data since, unless you're measuring change, it limits the range to between 0% and 100%. Programmers and statisticians almost always write a proportion in a decimal format so the range is between 0.0 and 1.0. 

Mathematically, however, the notation is a little more forbidding:
$$
p_{i} = \frac{x}{\sum_{i=1}^{n} x}
$$
However, it's important that you begin to familiarise yourself with the mathematical notation since many papers on computational geography will make use of this form. Let's break it down:

1. It's a fraction: the observation _x_ divided by, errrr, something involving x
2. The numerator is easy
3. The denominator is hard
4. The key is in the $i=1$ and $n$, which tells us that the sum is for 'all i' (i.e. summing for every _x_-value in the data set) 

In our case the sum was the total number of households in a LSOA. 

But the point is that the mathematical notation applies for every data set, not just our LSOA data. That's where the mathematical formula is more useful – it's not linked to the specifics of this particular data set. Hopefully you can see where the formula is applied in the code above. 

### Areal Standardisation

Areal standardisation is very similar to proportional, but applied to units of area. 

In this case, let's assume we have questions about how much greenspace there is in London, and how it varies across space. One thing we might want to do is compare LSOAs for how much greenspace area they have. But would comparing absolute greenspace area between LSOAs be fair given that LSOAs have different total areas? For example, let's look at the distribution of total _Area_ for LSOAs.

In [None]:
sns.distplot(dfs['Area'])

There seems to be even more variation between LSOAs in their total area than in the total number of households we saw above! The maximum area is several [orders of magnitude](https://www.khanacademy.org/math/pre-algebra/pre-algebra-exponents-radicals/pre-algebra-orders-of-magnitude/v/orders-of-magnitude-exercise-example-2) greater than the minimum. 

#### Task 4
Check the details of the distribution of the _Area_ variable for yourself by creating descriptive statistics for the _Area_ `Series`:   

So, similar to households, it might be more appropriate (depending on your research question) to compare standardised measures of greenspace area. This time we could look at LSOA greenspace area as a proportion of total area (a form of _areal standardisation_). As above, let's compare unstandardised vs standardised measures of greenspace.

First, by absolute area:

In [None]:
print(dfs.sort_values(by='GreenspaceArea', ascending=False).head(5).round(0)[['LSOA11NM','GreenspaceArea', 'Area']])

Now, to standardise by proportions. Let's first calculate a new `Series` in our `DataFrame` to hold the proportions, then do the same sorting as above to find the top LSOAs by proportion: 

In [None]:
dfs['GA_prop'] = dfs.GreenspaceArea / dfs.Area

print(dfs.sort_values(by='GA_prop', ascending=False).head(5).round(3)[['LSOA11NM','GreenspaceArea','Area','GA_prop']])

Compare the last two outputs. Hopefully you can see that while _Richmond upon Thames 012B_ is top in both cases, the next largest LSOAs are different. Check you understand why this is. 

In this case _Richmond upon Thames 012B_ is clearly the 'greenest' in both absolute and relative terms. But which of the other LSOAs listed do you think is 'greener'? It probably depends on how you want to define 'greener'... (which in turn depends on what other aspects of geography you're interested in) 

## Exercise:

Re-use the code above to compare the following variables between LSOAs:
1. total road area and road area standardised (areal) using a sensible variable 
2. total number of white people and white people standardised (proportion) by a sensible population variable 

In each case, examine the **bottom 10** LSOAs. 

In future weeks we'll see how we can map these proportions, like in [Lansley (2016)](https://doi.org/10.1080/21681376.2016.1177466) discussed in lecture.  

## Z-Score Standardisation

The z-score is another common type of standardisation. In some ways it's more complicated than a simple proportion, but in other ways it is simpler (given that is involves only a single variable). 

Both proportions and z-scores are designed to enable us to compare data across different groups of observations. We can easily compare two percentages to know which one is more, and which less (e.g. I got 80% on one exam and 70% on the other). But let's think about a slightly different question: _which exam did I do better on relative other students?_ What if you got 80% on an exam where everyone else got 85%? Suddenly that doesn't look quite so good right? And what if your 70% was on an exam where the average mark was 50%? Suddenly that looks a lot better, right?

The z-score is designed to help you perform this comparison directly.

As a reminder, the z-score looks like this:
$$
z=(x-\mu)/\sigma
$$

That's: `(<data> - <mean>)/<standard deviation>`

### Using the Z-Score

Let's start to bring the idea of the 'data generating processes' to life. The first thing to do with the z-score is to look at what it implies:

1. Subtracting the mean implies that the mean _is a useful measure of centrality_: in other words, the only reason to subtract the mean is if the mean is _meaningful_. If you are dealing with highly skewed data then the mean is not going to be very useful and, by implication, neither is the z-score.
2. Dividing by the Standard Deviation implies that this _is a useful measure of distribution_: again, if the data is heavily skewed or follows some exponential distribution then the standard deviation is not going to very useful as an analytical tool.

So the z-score is _most_ relevant when we are dealing with something that looks vaguely like a standard normal distribution (which has mean=0 and standard deviation=1). In those cases, anything with a z-score more than 1.96 standard deviations from the mean is in the 5% significance zone. 

But remember: we can't really say _why_ one particular area has a high concentration of well-off individuals or why one individual is over 2m tall. All we are saying is that this standardised value is a pretty unlikely outcome _when compared to our expectation that people are randomly distributed across the region_ or _that people have randomly-distributed heights of mean 'x' and standard deviation 'y'_. 

Of course, we _know_ that people aren't randomly distributed around the country in the same way that we know that height isn't genuinely random becuase of the influence of genetics, nutrition, etc. But we need a way to pick out what counts as _**significant**_ over- or under-concentration (or height) from the wide range of 'a bit more' or 'a bit less' than 'normal'. **_If_** a normal distribution does a good job of representing the overall distribution of heights (_whatever_ the reason) then someone of 2m is highly unlikely but we can't say _how_ unlikely until we've placed them on the distribution.

Let's put it another way:
* Is 10% of wealthy individuals in a small area a high concentration?
* How about 20%?
* 30%?

The only way to answer that question is to use something like the z-score since it standardises all of the values _against the average_. If wealthy people were distributed at random then we would _expect_ that most areas would have about the average concentration. Some areas will have a few more. Some areas a few less. But according to the way that the standard normal distribution works, _nowhere_ should have a z-score of 10. Or 20, since that is 20 standard deviations from the mean and just shouldn't exist in the lifetime of the universe. So if we see that kind of result then we know two things:

1. That our assumption that normal distribution is a reasonable representation of reality breaks down at some point.
2. That there are _some_ areas with _highly significant_ over- or under-representation by wealthy residents.

But that's ok, because we're trying to set an expectation of what we think we'll see so that we can pick out the significant outliers.

### Setting Expectations

And here we get to the crux of the issue, most frequentist statistics boils down to this: subtracting **what you expected** from **what you got**, and then dividing by **some measure of spread** to control for the range of the data. We then look at what's _left_ to see if we think the gap between expectations and observations is _meaningful_ or whether it falls within the range of 'random noise'.

It should be obvious that it's the _**expected**_ part of that equation that is absolutely crucial: we come up with a process that generates data that _represents_ the most important aspects of reality, but we shouldn't imagine that our process has explained them. It's the first step, not the last.

### Z-Score Standardisation in Pandas

Let’s take our ownership column and calculate a new series called `OwnedZStd`!

One way to do this is would be to do it in two stages: subtract the mean to create a new series, then divide by the standard deviation into another new series (_note:_ `<column>` is just a generic name; e.g. substitute 'Owned' for column to create actual code):
```python
dfs['<column>LessMean'] = dfs.<column> - dfs.<column>.mean()
dfs['<column>ZStd']     = dfs.<column>LessMean / dfs.<column>.std()
```

So for our ownership variable:

In [None]:
dfs['OwnedLessMean'] = dfs['Owned'] - dfs['Owned'].mean()
dfs['OwnedZStd'] = dfs['OwnedLessMean'] / dfs['Owned'].std()

And now print to check:

In [None]:
print(dfs.head()[['LSOA11NM','Owned','OwnedLessMean','OwnedZStd']])

Let's be clear: the code is _perfectly acceptable_ and _perfectly accurate_. It is also, however, not as elegant as it might be since it creates extra columns of data that are really only temporary variables that don't need to be added to the data frame.

By now you've seen how we can 'chain' together method calls, then you should know that we could _also_ do it this way:
```python
df[<column>] = 
     (df.<column> - df.<column>.mean()) / df.<column>.std()
```
And for our ownership data:

In [None]:
dfs['OwnedZStd-again'] = (dfs['Owned'] - dfs['Owned'].mean()) / dfs['Owned'].std()

In [None]:
print(dfs.head()[['LSOA11NM','Owned','OwnedLessMean','OwnedZStd','OwnedZStd-again']])

Do you see how we can begin to build increasingly complicated equations into the process of creating a new data series?

To check you understand what the Z score (standardisation) is doing, let's plot histograms for the original data, the ZStd column and the proportion column we made earlier. 

In [None]:
cols = ['Owned','OwnedZStd','Owned_prop']

for c in cols:
    plt.figure()
    sns.distplot(dfs[c], kde=False)

Note how _Owned_ and _OwnedZStd_ have identical shapes (i.e. bars have identical heights), but we have changed the x-axis so that 0 is centred on the mean value for Owned across all LSOAs. In contrast, the *Owned_prop* has changed the shape of the plot _and_ the values on the x-axis. Thus, the proportion standardisation is creating a new variable, whereas the Z-score standardisation is just shifting the scale of the original variable. 

Z-scores are useful, because that have properties that we know about: 
- a value of 0 is equal to the mean of the original variable
- values >0 are greater than the mean
- values <0 are less than the mean
- each unit of z is equal to one standard deviation of the original variable

And we can check some of these by printing the 'top' five LSOAs for the Owned variable again:

In [None]:
print(dfs.sort_values(by='Owned', ascending=False).head(5).round(3)[['LSOA11NM','Owned','OwnedZStd']])

We can see these LSOAS all have z-scores greater than 0 and they descend in value at the same rate as the Owned variable. Most importantly we can concluse from these data that these LSOAs are are all several standard deviations greater than the mean, something we wouldn't know by just looking at the values of the Owned variable itself (check you understand why this is!). 

As succinctly put [here](http://colingorrie.github.io/outlier-detection.html), z-scores are:

> a way of describing a data point in terms of its relationship to the mean and standard deviation of a group of points. Taking a Z-score is simply mapping the data onto a distribution whose mean is defined as 0 and whose standard deviation is defined as 1. The goal of taking Z-scores is to remove the effects of the location and scale of the data, allowing different datasets to be compared directly.

### Exercise

Using the code above as a template, create a _GreenspaceAreaZStd_ variable in the df `DataFrame` then plot its distribution compared to the original _GreenspaceArea_ variable. 

## The Normal Distribution

Z-scores are often associated with the normal distribution because their interpretation often implicitly assumes a normal distribution. 

Or to put it another way...

You can always calculate z-scores for your data (it's just a formula applied to data points), but their _intuitive meaning_ is lost if your data don't have something like a normal distribution (or follow the [68-95-99.7 rule](https://en.wikipedia.org/wiki/68–95–99.7_rule)).

What is this 'intutitive meaning'? Let's re-visit the properties of the normal distribution to see if we can find it...

#### The Standard Normal

The standard normal distribution has some useful properties:

![Standard Normal](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/500px-Boxplot_vs_PDF.svg.png)

Note that for a perfect normal distribution, we know how much of the data falls within each standard deviation from the mean.  

Note also, how the figures above show the distribution in terms of a mean of zero and units of standard deviation; these are exactly the properties of z-scores! So hopefully you can see how knowing the z-score for an observation from a normally distributed variable (e.g. an attribute of an LSOA) is useful to understand where is falls in the distribution (regardless of the absolute value of the observation).  

But... what if our data are non-normal? Well, Just because data are non-normal doesn't mean z-scores can't be calculated; we just have to be careful what we do with them... and sometimes we should just avoid them entirely. 

Let's look at our data to check we understand when Z-scores are useful (i.e. when our data are 'normal enough'). There are formal tests for establishing the probability that your data are normally distributed, but there are dangers to using these tests with large data sets. So we'll use another approach in which we plot our data with a theoretical normal distribution based on the mean and standard deviation of our data. 

Below is a function to create that theoretical normal distribution. See if you can understand what's going and add comments to the code to explain what each line does. 

In [None]:
import numpy as np  

def normal_from_dist(series): 
    mu = series.mean()         
    sd = series.std()          
    n  = len(series)           
    s = np.random.normal(mu, sd, n)   
    return s                   

To make it easier to understand what the function above is doing, let's use it! We'll use the function to plot both a distribution plot with both histogram and KDE for our data, and then add a _second_ overplot distplot to the same fig showing the theoretical normal distribution (in red). We'll do this in a loop for each of the three variables we want to examine.

**From the output, which of the three variables has a roughly normal distribution?** Another way to think about this question is, for which of the variables are the mean and standard deviation _most_ appropriate as measures of centrality and spread? 

In [None]:
for c in ['GreenspaceArea','MedianIncome','Owned']:
    fig1 = plt.figure()
    ax1 = fig1.add_subplot(111)
    sns.distplot(dfs[c]) 
    sns.distplot(normal_from_dist(dfs[c]), hist=False, color='red') 

From the output, hopefully you can see that:

- _GreenspaceArea_ has a highly non-normal distribution; we shouldn't try to identify any outliers in these variable with the data like this
- _MedianIncome_ is close to normal but it is right skewed 
- _Owned_ has the 'most normal' distribution - we can see this because this is the variable for which the red and blue lines most closely align. 

We might be tempted to use assume the _MedianIncome_ data are close to normal enough, but in the Transforms section below we'll examine if we could make it even more normal. 

This approach may seem quite rough, but remember what we said right at the outset of this notebook; statistics is not _truth_, it is _judgement_.

### Exercise

Examine the _HHOLDS_, _COMESTRES_ and _POPDEN_ variables to check their distribution. Re-use code from above to plot their distributions with a standard theoretical normal distribution. For any that you think are roughly normal, calculate Z-scores. For _all three_ variables report mean and median and state which you think is most appropriate to use as a measure of central tendency.  

----

# Transformation

Transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more *tractable* for subsequent analysis. In other words, it's all about the _context_ of our data.

[![How tall is tall?](http://img.youtube.com/vi/-VjcCr4uM6w/0.jpg)](http://www.youtube.com/watch?v=-VjcCr4uM6w)

To be clear from the outset, normalisation isn't _just_ about trying to make our data 'more normal' so we can interpret z-scores. It's also really useful when we want to use certain statistical models (like linear regression), as we'll see later in the module. 

From above, we know the _Owned_ variable is pretty normally distributed; not perfect as we discovered above, but good enough. But let's see if we can make the _MedianIncome_ data more normally distributed. The first step is to try to anticipate what distribution _MedianIncome_ currently has. This can be done by comparing the shape of the histogram to the shapes of theoretical distributitions. For example:
- the [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution) distribution
- the [exponential](https://en.wikipedia.org/wiki/Exponential_distribution) distribution
- the [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution) distribution (for non-continuous data)
 
From looking at those theoretical distributions, we might make an initial guess that the _MedianIncome_ data have a log-normal distribution. In this case, making a log-transform of the data might make it more normal. 

## Logarithmic Transformation

Logarithmic transformations are also considered fairly simple, but they are _non_-linear transformations and so they do change the relationships in your data in important ways. Although you _could_ use any logarithm, the natural log is considered the most useful since both the mean and standard deviation retain _some_ meaning (though you probably wouldn't report these as such). If you don’t remember what a logarithm is try these:

[![From reality to make-believe](http://img.youtube.com/vi/zzu2POfYv0Y/0.jpg)](https://www.youtube.com/watch?v=zzu2POfYv0Y)

[![From reality to make-believe](http://img.youtube.com/vi/akXXXx2ahW0/0.jpg)](https://www.youtube.com/watch?v=akXXXx2ahW0)

[![From reality to make-believe](http://img.youtube.com/vi/0fKBhvDjuy0/0.jpg)](https://www.youtube.com/watch?v=0fKBhvDjuy0)

The last video was made by Ray & Charles Eames, two of the 20th Century’s most famous designers.

### Logarithmic Transforms in Pandas

To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before, but since pandas doesn't provide a log-transform operator (i.e. you can’t call `df['MedianIncome'].log()` ) we need to use the `numpy` package:
```python
series = pd.Series(np.log(df['MedianIncome']))
```
Let's perform the transform then compare to the un-transformed data. Comment the code below to ensure that you understand what it is doing. 

In [None]:
import numpy as np  

dfs['logMedianIncome'] = pd.Series(np.log(dfs.MedianIncome)) 

print(dfs.describe().round(1)[['MedianIncome','logMedianIncome']])  

cols = ['MedianIncome','logMedianIncome']  
for m in cols:                           
    series = dfs[m]
    plt.figure()
    sns.distplot(series)
    sns.distplot(normal_from_dist(series), hist=False, color='red') 

Hopefully, you can see that the transformed data do indeed look 'more normal'; the peak of the red and blue lines are closer together and the blue line at the lower extreme is also closer to the red line. 

Let's see what that has done to the z-scores.

## Exercise

Calculate z-scores for the _MedianIncome_ data and the log-transformed _MedianIncome_ data, and print out the top and bottom 10 observations for _MedianIncome_ and compare the z-scores. Check you can see how the lower extreme z-scores for the log-transformed data reflect the histogram above. 

Now let's look at the **Greenspace Area** data. These data were even further from a normal distribution than the _MedianIncome_ data and they don't really look very log-normal but sometimes it's good just to start with what you know. So let's try a log transform first and then plot the distribution:

In [None]:
dfs['logGSA'] = pd.Series(np.log(dfs['GreenspaceArea'])) 
sns.distplot(dfs['logGSA'])

Did you get an error? Can you work out why that is....?

Okay, so the reason is that there are many LSOAs with 0 GreenspaceArea. As the log of 0 is theoretically impossible, the new data series we created above contains some strange values: 

In [None]:
print(dfs['logGSA'])

See all those `-inf`? Seaborn just doesn't know what to do with them!

We frequently have zeros in our data so this sort of thing happens a lot in data analysis. A common solution in this case is to add a small value to _every_ observation (shifting the scale) such that there are no zeros in the data. _Then_ we can take make a log transform with no problem. To prove this:

In [None]:
dfs['logGSA'] = pd.Series(np.log(dfs['GreenspaceArea']+0.1)) 
print(dfs['logGSA'])

Hopefully you can see the small edit I made to the previous code. _Now_ we don't have strand values and Seaborn should be a little happer:

In [None]:
sns.distplot(dfs['logGSA'])

Success!

But now this plot _is_ interesting: the figure shows what seems to be two quite different things going on in our data! We've obviously got the LSOAs that contain _no_ greenspace, but then we've got something else that is _much_ closer to 'normal' (though not properly normal). 

In this case we should maybe start thinking about splitting this variable into two and thinking about the LSOAs in each group; how and why are LSOAs with zero greenspace different from those with some greenspace? One thing we might do for example is map the location of the different groups of LSOAs (but we'll get to that later). 

For now, let's think about how we would split the data to plot a histogram 

## Exercise

Plot a histogram of the log-transformed greenspace area for LSOAs with some greenspace, and think about what the plot shows: 
- how are the data skewed?
- what are the mean and median?

What about if we looked at the distribution of greenspace area _proportion_?

These are the sorts of questions we should be raising in our minds as we go through the data analysis process (of course, guided by our research questions...)

## Credits!

#### Contributors:
The following individuals have contributed to these teaching materials: Jon Reades (jonathan.reades@kcl.ac.uk), James Millington (james.millington@kcl.ac.uk)

#### License
These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).

#### Acknowledgements:
Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

#### Potential Dependencies:
This notebook may depend on the following libraries: pandas, matplotlib, seaborn