<center><h1>7SSG2059 Geocomputation 2018/19</h1></center>

<h1><center>Practical 7: Data Manipulation</h1></center>



# Manipulating Data & `DataFrames`

As we've discussed in lectures, manipulating data can be a major component of data analysis. This week will look at some further ways to manipulate data that might be useful for you when analysing the data for your final report. 

Specifically we will:
1. recap on some useful sorting and selecting methods
2. see how we can combine two `DataFrames` together using common properties for further analysis
3. look at how we can group data for analysis (e.g. grouping LSOAs for borough-level analysis)

## Setup

As usual we will be using pandas and doing some data analysis plotting, so we need to import the relevant packages (note the aliases used for reference below):

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We will start working with the initial LSOA data that we have been using previously:

In [None]:
my_df = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz?raw=true',
    compression='gzip', low_memory=False) # The 'low memory' option means pandas doesn't guess data types

Later we'll look at how we can add more data to this `DataFrame` later, but first let's just check what columns of data we have:

In [None]:
my_df.columns

Okay, now we have our data loaded and we've reminded ourselves of what the data set contains (maybe by consulting the [metadata](https://github.com/kingsgeocomp/geocomputation/raw/master/Data/LSOA_metadata.xlsx)) we can move on. 

# Recap: Sorting and Selecting 

## Finding Rows in the Data

You should remember from Week 3 that we can find out _where_ (i.e. which LSOA) the maximum value in the data occurs using code like this:
```python
my_df[my_df.POPDEN == my_df.POPDEN.max()]
```

### Task:
Write some code to list all of the LSOAs where the population density is more than two standard deviations greater than the mean population density of all London LSOAs: 

Hopefully, your code returns you 198 LSOAs.

Good, now what about if we want to find the top 10 LSOAs in terms of population density, and examine how many households there are in those LSOAs? Recall that we did this last week in one line:

In [None]:
sort_df = my_df.sort_values(by='POPDEN', ascending=False).head(10)[['LSOA11NM','POPDEN','HHOLDS']]

print(sort_df)

Take a look at that line of code and check you can see how the different lines previously have been combined. 

Let's pull it apart step-by-step at the code level:

* The first step in this process is `my_df.sort_values` -- you can probably guess what this does: it sorts the data frame!
* The parameters passed to the `sort_values` function are `by`, which is the column on which to sort, and `ascending=False`, which gives us the data frame sorted in _descending_ order!
* The output of `my_df.sort(...)` is a _new_ data frame, which means that we can simply add `.head(10)` to get the first ten rows of the newly-sorted data frame.
* And the output of `my_df.sort(...).head(...)` is yet _another_ data frame, which means that we can print out the values of selected columns using the 'dictionary-like' syntax: we use the outer set of square brackets (`[...]`) to tell pandas that we want to access a subset of the top-10 data frame, and we use the inner set of square brackets (`['LSOA11NM','POPDEN','HHOLDS']`) to tell pandas which columns we want to see.

I'd say 'simples, right?' but that's obviously _not_ simple. It _is_, however, very, very _elegant_ because it's quite clear (once you get past the way that lots of methods can be chained together) and it's very succinct (we did all of that in _one_ line of code!).

### Task:
In a single line of code create a new `df` containing information about the name, population density and number of usual residents for the seven least populated LSOAs (in terms of usual residents). Then use another line of code to print the new `df`: 

## Taking a Random Sample of Data

Of course, sometimes you don't want a particular range of data, you want a _random sample_ so that you can either 
a. get a better sense of the data, or 
b. perform some kind of test with a subsample before replicating on the full data set. 

Pandas has [got you covered](http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#selecting-random-samples) with a huge range of options, including sampling with replacement, sample weights, row numbers and a fraction of the data set. 

Let's look at some simple examples:

In [None]:
my_df.sample(n=5)[ ['LSOA11NM','POPDEN','USUALRES'] ] # Sample of size 5

In [None]:
my_df.sample(n=5)[ ['LSOA11NM','POPDEN','USUALRES'] ] # This will not give you the same sample

Note that even though the two lines of code above are identical we return a different (random) sample of rows. This is useful but what if we want to give our code to someone else to that they would get the same (random) sample of rows?  To do this we can specify the `random_state` argument:

In [None]:
my_df.sample(n=5, random_state=2)[ ['LSOA11NM','POPDEN','HHOLDS'] ] 

By specifying the same value for `random_state` we will get the same sample: 

In [None]:
my_df.sample(n=5, random_state=2)[ ['LSOA11NM','POPDEN','HHOLDS'] ] 

And using a different value for `random_state` gives us a different sample:

In [None]:
my_df.sample(n=5, random_state=3)[ ['LSOA11NM','POPDEN','HHOLDS'] ] 

We can also specify the fraction of the `DataFrame` we want to sample, rather than an absolute number of observations (think about why this is useful for when we don't know what size `DataFrame` our code might be used with):

In [None]:
my_df.sample(frac=0.002)[ ['LSOA11NM','POPDEN','USUALRES'] ] # Sample a fraction of the rows (here 0.2%)

Finally, the code above has automatically been sampling rows of data, but we can also sample columns by specifying the `axis` of the `DataFrame` we want to sample: 

In [None]:
my_df.sample(n=5, axis=1).head(10)

# Combining Data

Up until this point we have been working with a dataset of ~48 variables (columns) for the LSOAs. But what if we have additional data for LSOAs that we want to work with together with our original data, for example to look for correlations between variables. Here we will look at how to combine two datasets that have data for individual LSOAs:
1. our original data
2. data for air quality in each LSOA

Combining these data would be useful, for example, to examine relationships between air quality and socio-economic and other variables. 

In this practical we will look at how to `merge` Pandas dataframes. There's also `join` and `concatenate` function. Each of these functions are slightly different:
- `merge` enables us to [combine](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) two dataframes based on a column that is common between them
- `join` is used to [combine](http://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index) two dataframes when they share a common `index` (e.g. a `DateTimeIndex` in timeseries data)
- `concatenate` [combines](http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects) dataframes regardless of common attributes. 

We'll then look at how to analyse variables in the combined `DataFrame` we produce using correlation later in the module. 

## Air Quality Data

Metadata about the air quality data are included in the [metadata](https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA_metadata.xlsx?raw=true) file. The data themselves are hosted online and can be read using:

In [None]:
aq_df = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA_AirQuality.csv.gz?raw=true',
    compression='gzip', low_memory=False) # The 'low memory' option means pandas doesn't guess data types

In [None]:
aq_df.head()

### Task

Familiarise yourself with the data you have just loaded in and compare it to the data we have worked with previously. To do this you might:
1. check the column names and data types of the air quality data file and compare to the metadate file (hint: use `info()` method for `DataFrame`s - you might need to google this)  

2. calculate descriptive statistics for the air quality data 

3. compare the column names of the air quality dataset with the original LSOA dataset

4. compare the shapes of the two `DataFrames`

From your exploration of the new data and comparison with the original LSOA data you might notice a few things:
1. They have the same number of rows
2. They have different numbers of columns
3. They share one column name in common (`LSOA11CD`)

Check you can see these observations for yourself. 

## Merge

If we have a column in each of two `DataFrames` that contains the same identifier (column) for the other variables (columns) in the data, we can use the common identifier to define how the two `DataFrames` are joined together. For example, the data we are working with are for LSOAs (distinct geographical regions) - if any additional data we have is also for LSOAs, as long as we we have a common way of identifying the LSOAs in both `DataFrames` we can `merge` the `Data Frames`. 

Hopefully from the task above that we have a common identifier in both the orginal data `my_df` and the new air qualiy data `aq_df`: `LSOA11CD`. The `LSOA11CD` is a unique identifier code for each LSOA. We can use this to match rows of data in `my_df` (each of which is for a particular LSOA) with the corresponding rows in `aq_df` (which are also for individual LSOAs).

With the common identifier identified, we now need to decide what type of join we want to do. Recall from this week's lecture that there are four main types of 'join':
1. left
2. right
3. outer
4. inner

We could use any of the above depending on our objectives. 

Here we'll do a **left join**, where the left `df` will be the original data and the right `df` will be our new air quality data. This seems appropriate so that we don't modify the original data too much (thereby potentially messing up some of our previous code):

In [None]:
#merge the two data frames 
merge_df = pd.merge(my_df, aq_df, how = 'left', on = 'LSOA11CD')

Okay, now let's check what the columns are in the new `DataFrame` we just created: 

In [None]:
print(merge_df.columns)

And let's check from a sample of the data how the rows look: 

In [None]:
merge_df.sample(n=5, random_state=3)[ ['LSOA11CD','POPDEN','HHOLDS', 'PM25mean', 'PM25min', 'PM25max', 'PM25sd'] ] 

Hopefully it looks good so far. Let's review what we did the with `pd.merge` function above:
```python
pd.merge(my_df, aq_df, how = 'left', on = 'LSOA11CD')
```

1. `my_df` is the left `df`
2. `aq_df` is the right `df` 
3. `how` defines what type of join is to be done
4. `on` is the column we want to use as the common identifier to 'join on' 

So above, each value in the `LSOA11CD` column in `aq_df` is matched with the same value in the `LSOA11CD` column in `my_df` and the rows those values are found in are combined. The figure below illustrates the process (and look back to the lecture slides and see this [nice tutoral](https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/)). 

![Illustration of the Pandas merge function](http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_left.png)

Check you understand how something similar to the image above has been done for our LSOA data. Remember you can read [the documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) for more detailed explanation. 

Even though we used the pandas `merge` function here, we are doing what we called a _join_ in the lecture; the only difference between pandas `merge` and `join` is that the former uses a common column whereas the latter uses a common index. The 'merge' column can do all four of the joins we have considered (left, right, outer and inner).   

To check if there were any missing values introduced into our new `DataFrame` we can do [a quick check](https://stackoverflow.com/a/29530601):

In [None]:
merge_df.isnull().values.any()

Hopefully you received a `False` response! If so, this is more evidence the join worked (if not you might want to check what you did above and ask for help). 

Let's save these data for later - they may be useful for your final report!

In [None]:
merge_df.to_csv("LondonLSOAData.csv", index=False)
#or
#merge_df.to_csv("LondonLSOAData.csv.gz", compression='gzip', index=False)

(If you really want to check what the join has done, you might open the file you just saved in Excel to have a look)

### Exercises:

Explore the air quality data to get an understanding of what they might show you in relation to other variables in the data set. For example:

1. Find the population densities of the LSOAs with highest maximum values for each of the four pollutants.  

2. Create a single boxplot to compare the distributions of the mean values of each of the four pollutants  

3. Create four scatter plots within a single figure (use a loop) to visualise the relationship between area within 250m of a major road and the minimum values of each pollutant _[If you find it difficult you may want to skip this exercise in practical to ensure you can work through the Grouping section while help is available]_

# Grouping Data

Often in geographical data, we have data specified for different aerial units; counties, State parks, constituencies, etc. And frequently, these are units are hierarchical; counties are sub-units within states (e.g. Yorkshire is within England), postcode units within postcode areas (e.g. WC2R 2LS is within WC2R). This is particularly true for census data and so for the data we are working with; Lower Super Output Areas (LSOAs) are within London Boroughs. 

If we want to investigate differences or similarities at the borough-level, we are going to have to group the data for individual LSOAs into their respective boroughs. Then we can summarise the boroughs as a whole. 

As you can see from the [metadata](https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA_metadata.xlsx?raw=true), the _LAD11CD_ and _LAD11NM_ columns contains Local Authority District (i.e. Borough) IDs and names. These columns are useful as they specify for every LSOA (which are on different rows) and ID and the name of the borough each LSOA lies within. 

We could check the contents of these columns using the `unique()` method:

In [None]:
print(merge_df.LAD11CD.unique())
print(merge_df.LAD11NM.unique())

There are two ways we could use these columns to analyse our data at the borough level:
1. create new `Dataframes` for each individual borough
2. tell pandas to group the data using the values in the borough column 

The first approach might be useful if we want to examine one or few particular boroughs in detail. We could create these new `DataFrames` using the selection methods we have seen previously (but also [others you could learn about](http://pandas.pydata.org/pandas-docs/stable/indexing.html)). However, if we want to work with data for all London boroughs, this method would not be particularly easy to work with. 

So for the second approach, the pandas library has another data structure known as `DataFrameGroupBy` which is useful in this situation (see [documentation here](http://pandas.pydata.org/pandas-docs/stable/groupby.html)). We'll examine this approach in more detail now. 

### `DataFrameGroupBy`

The `DataFrameGroupBy` data structures is created using the `groupby` method. To do this, grouping by borough:

In [None]:
boroughs = merge_df.groupby('LAD11NM')

The boroughs `DataFrameGroupBy` object is a special type of `DataFrame` that has additional methods available based on what groups we have specified (in this case the borough ID). For example, when we use the `head()` method it looks similar to a normal `DataFrame`:

In [None]:
boroughs.head()

But when we try to get the shape of the object we find it's slightly different from a normal `DataFrame` (you should get an error): 

In [None]:
boroughs.shape[0]

So we can’t use shape to find out how many elements in the boroughs `DataFrameGroupBy` object, but we can use our old favourite function `len()` (which works pretty much anywhere!). Compare the output for the next two lines of code:

In [None]:
len(boroughs) 

In [None]:
len(merge_df)

The length of `boroughs` is the number of groups in the `DataFrameGroupBy` object, whereas the length of `merge_df` is the number of rows in the `DataFrame` object. Check you understand the difference! We can tell from this that there are 33 boroughs (groups) and 4835 LSOAs (rows).

The difference between `merge_df` and `boroughs` also results in different output for other methods. Compare the output of the following:

In [None]:
boroughs["LAD11NM"].count()

In [None]:
merge_df["LAD11NM"].count()

See how the `count` method for the `DataFrameGroupBy` object gives the count of LSOAs within in each borough (group) whereas the count method for the `DataFrame` object simply gives the count of the total number of LSOAs (rows). 

Note, the following two lines of input code would do exactly the same as the last two but using dot notation
```python
boroughs.LAD11NM.count()
merge_df.LAD11NM.count()
```

There are other methods we can use on `DataFrameGroupBy`, for example `get_group()` gets the data (for LSOAs) for just one of the groups (boroughs):

In [None]:
boroughs.get_group("City of London")  

Note that to access the data for this group, we pass a value from the column we used to define the groups using `groupby()` previously. As we used _LAD11NM_ to specify our groups above, here we were able to type the name we wanted (_"City of London"_). But if we had used _LAD11CD_ to specify the groups, we would have had to pass _E09000001_.

Using the `DataFrameGroupBy` object also allows us to describe the data by group (rather than for all of the LSOAs as we did before). For example, to find the mean values for the columns by borough we can use the `aggregate()` method: 

In [None]:
bMeans = boroughs.aggregate(np.mean)
print(bMeans)

Note how the `aggregate()` method makes a call to the `numpy` function `mean`; this is why we needed to `import numpy as np` in the setup section at the start of the notebook. 

The `aggregate` method returns a `DataFrame`. Check this by:
1. printing the type of object of `bMeans` 
2. printing the `POPDEN` and `HHOLDS` columns of the new `DataFrame`

In [None]:
type(bMeans)

In [None]:
print(bMeans[['POPDEN','HHOLDRES']])

Check you understand what has been produced here; `bMeans` contains the mean (average) of all columns in our original dataset but aggregated (grouped) by borough.

### Asking questions about boroughs

Let’s see how this all might be useful for answering a geographical question. Say we want to calculate what proportion of the population of the Borough of Harrow that identifies as 'White' ethnicity:

In [None]:
boroughs = merge_df.groupby('LAD11NM')  #as above  
bSums = boroughs.aggregate(np.sum)      #sum of columns grouped by borough

harrow_sumW = bSums.White.loc["Harrow"]          #note: equivalent using dot notation is harrow_sumW = bSums.White.Harrow   
harrow_sumRes = bSums.USUALRES.loc["Harrow"]     #note: equivalent using dot notation is harrow_sumRes = bSums.USUALRES.Harrow

harrow_propW =  float(harrow_sumW) / float(harrow_sumRes)                 #convert to float when calculating proportion 
print("The proportion of Harrow that is White ethnicity is: {0:.3f}".format(harrow_propW))   #print nicely

Run the code above and check that you find that the proportion is 0.422. 

### Iterating over `DataFrameGroupBy`

Finally, a short note to highlight that [iterating over groups](http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups) in a `DataFRameGroupBy` object is much the same as looping over many other objects in python. For example, to iterate over all boroughs (groups) printing out the total population of each: 

In [None]:
boroughs = merge_df.groupby('LAD11NM')

for key, value in boroughs:
    popn = value.USUALRES.sum()
    print("{0:8.0f} people in {1}".format(popn,key))

Have a think about what the code above does:
1. iterating over the groups returns a tuple composed of `key` and `value`
2.`value` allows us to get to the actual data in each group (borough)
3. we can use `key` to get the label of each group (in this case the values of `LAD11NM` used to create the `GroupBy` object)

Note, the above is just an example to show the structure of how to loop over `GroupBy` objects and we could have done much the same (although without the nice string formatting) using:

```python
print(boroughs.USUALRES.sum())
```

### Exercises

Similar to the exercises above for pollutants, but this time for borough-level data:

1. Find the total number of usual residents of the boroughs with highest maximum values for each of the four pollutants.  

2. Create a barplot to compare the borough-level means of mean values of each of the four pollutants  

3. Create four scatter plots within a single figure (use a loop) to visualise the relationship between total area within 250m of a major road within a borough and the minimum values of each pollutant

# Summary 

In this practical we have:
1. had a recap of some sorting and selecting
2. seen how to take a random sample of data
3. introduced ourselves to the air quality data and combined it with our other data
4. learned about the `DataFrameGroupBy` data structure. 

You now have the 'full' data set (combining the original data with the air quality data) that you can use for your final report. So start exploring!

If you want to join your own LSOA data for analysis in your final report, please discuss with James before doing so. 

## Credits!

#### Contributors:
The following individuals have contributed to these teaching materials: Jon Reades (jonathan.reades@kcl.ac.uk), James Millington (james.millington@kcl.ac.uk)

#### License
These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).

#### Acknowledgements:
Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

#### Potential Dependencies:
This notebook may depend on the following libraries: pandas, matplotlib, seaborn