<center><h1>7SSG2059 Geocomputation 2017/18</h1></center>

<h1><center>Practical 6: Data Manipulation (Merging and Joining in Panadas)</h1></center>

<p><center><i>James Millington, 31 October 2017</i></center>



# Manipulating Data & Data Frames

Intro here

In [None]:
import pandas as pd


Load data

Quick recap of selecting (week 3) and sorting (week 4)  

## Finding Rows in the Data

You should remember from Week 5 that we can find out _where_ the maximum value in the data occurs using this code:
```python
df[df.Group1Pct == df.Group1Pct.max()]
```

Can you figure out how to list all of the LSOAs where the proportion of Group 1 residents is greater than 50%? 

Now, let's find out what are the top 10 areas in terms of the concentration of Group 1 residents and compare to the top 10 areas in terms of _raw_ counts of Group 1 residents... This is a nice illustration of how you can _chain_ together a whole series of methods to do some pretty cool stuff:

In [None]:
df.sort_values(by='Group1Pct', ascending=False).head(10)[['GeoLabel','Group1','Group1Pct']]

In [None]:
df.sort_values(by='Group1', ascending=False).head(10)[['GeoLabel','Group1','Group1Pct']]

Just so that you understand what we just did with this:
1. Take the data frame `df`;
2. Sort it by descending order;
3. Take the first ten values;
4. Print out the columns specified by the list.

Let's pull it apart step-by-step at the code level:

* The first step in this process is `df.sort_values` -- you can probably guess what this does: it sorts the data frame!
* The parameters passed to the `sort_values` function are `by`, which is the column on which to sort, and `ascending=False`, which gives us the data frame sorted in _descending_ order!
* The output of `df.sort(...)` is a _new_ data frame, which means that we can simply add `.head(10)` to get the first ten rows of the newly-sorted data frame.
* And the output of `df.sort(...).head(...)` is yet _another_ data frame, which means that we can print out the values of selected columns using the 'dictionary-like' syntax: we use the outer set of square brackets (`[...]`) to tell pandas that we want to access a subset of the top-10 data frame, and we use the inner set of square brackets (`['GeoLabel','Group1','Group1Pct']`) to tell pandas which columns we want to see.

I'd say 'simples, right?' but that's obviously _not_ simple. It _is_, however, very, very _elegant_ because it's quite clear (once you get past the way that lots of methods can be chained together) and it's very succinct (we did all of that in _one_ line of code!).

# Taking a Cut of the Data

For some types of plots we can be overwhelmed by the data -- trying to show the distribution of Group 1 people as grouped by, say, borough or Local Authority would be tricky since, not only are there a lot of them, but we also don't even _have_ a LA column to use!

To take a 'cut' of the data based on LAs, we need some way to get to grips with what's happening in the `GeoLabel` series as that's pretty obviously the only place we're going to get location data. 

## Taking a Random Sample

Rather than trying to step through the whole data frame, wouldn't it be handy to take a random sample first. There are two ways to do this; as usual, [Stack Overflow](http://stackoverflow.com/questions/15923826/random-row-selection-in-pandas-dataframe) is your friend, but in this case it turns out that pandas has improved since that question was asked (which is why it always pays to check the latest documentation) and now supports sampling directly:

## Sampling

OF course, sometimes you don't want a particular range of data, you want a _random sample_ so that you can either a) get a better sense of the data, or b) perform some kind of test with a subsample before replicating on the full data set. Pandas has [got you covered](http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#selecting-random-samples) with a huge range of options, including sampling with replacement, sample weights, row numbers and a fraction of the data set. 

Let's look at some simple examples with the full NS-SeC data set:

In [None]:
df.sample(n=5)[ ['CDU','GeoCode','GeoLabel'] ] # Sample of size 5

In [None]:
df.sample(n=5)[ ['CDU','GeoCode','GeoLabel'] ] # This will not give you the same sample

In [None]:
df.sample(frac=0.00025)[ ['CDU','GeoCode','GeoLabel'] ] # Sample a fraction of the rows

In [None]:
# Now we can take a random sample
# of 20 rows from df (or any other
# number of rows)
sdf = df.sample(n=20)

# And let's look at the geography
sdf.GeoLabel

If you run the random sample several times, you'll see that there is something of a patter here: there is something that is obviously the name of a Local Authority (LA) and it's often followed by some kind of identifier (because LSOAs are smaller than LAs). There are a couple of ways that we can approach this:

1. We could try to match on the name of the LA in the GeoLabel (*e.g.* find the GeoLabels starting with 'Hackney')
2. We could try to find a way to strip off the identifier so that what we'd be left with was the name of LA.

### Matching on the Start of a Word

This is easier in the short-term because we can just say "find the Hackney GeoLabels" but it's less flexible in the long-term because we can't actually use the GeoLabel column as a way to group our results (because each GeoLabel is _still_ unique, so grouping will group by LSOA).

But let's take a look at how that works anyway:

In [None]:
df.ix[df.GeoLabel.str.startswith('Hackney'),['CDU','GeoCode','GeoLabel','Total']].head(10)

You'll see that that gave us the Hackney LSOAs, but without modifying any of the data. This is also the first time you've seen the `<data frame>.ix` notation which allows us to combine row and and column access via a mix of integer and label. In other words, you can use this to [select _any_ kind of subset you like](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html).

Just to make it really obvious what I did:
```python
<data frame>.ix[ <row selection criteria>, <column selection criteria> ].head()
```
So this is:
1. Select rows where `<data frame>.<series name>.str.startswith(<search string>)` (treats the LSOA GeoLable as a string and the searches for strings that start with...)
2. Select columns in the list `['CDU','GeoCode','GeoLabel','Total']`
3. Return the first 10 rows using `head()`

You could _also_ use integer selection on the columns as you would with any normal list: e.g. `xrange(2,5)`.

### Selecting by Match

Now we can take a 'cut' of the data by selecting only some London-based boroughs for further analysis... we’re going to arbitrarily select K&C, Hackney and Barking because I know they’re quite different boroughs, but you could also use the data to make this selection, right? For instance, I could ask pandas to help me pick the three boroughs that are the furthest apart in terms of their Group 1 means...

Anyway, here’s the code to select a cut:
```python
# sdf = subsetted data frame

# Select where the LA value 'is in' one of our pre-defined list
sdf = df.loc[df.LA.isin(['Kensington and Chelsea','Hackney','Barking and Dagenham'])]

# Remove the remaining unused categories
sdf.LA = sdf.LA.cat.remove_unused_categories()

# And a simple check to see how many categorical values are left
print("sdf now contains: {0} values".format(sdf.LA.describe().unique()[1]))
```
You’ll notice two unusual bits of code in there that need some explanation:
* To 'select  multiple' we need to write: `<data frame>.<series name>.isin([...])`, so we can't just write `<data frame>.<series name>==[...]` unfortunately.
* We also have this line: `<data frame>.<column name> = <data frame>.<column name>.cat.remove_unused_categories()`. This should be fairly self-explanatory, but it’s because by default pandas doesn’t update the list of valid categories (i.e. Local Authorities) just because we filtered out boroughs that weren’t of interest. We therefore need to update the Series so that Seaborn doesn’t include a bunch of empty categories in when we make our plots.

## Grouping Data

4.2.4	Now that we have our Borough column we can tell pandas to group the data using the values in that column (alternatively we could do some filtering/selecting on the column as we’ll see in section 4.3). The pandas library has another data structure known as DataFrameGroupBy which is useful in this situation (read more here). We can create one of these data structures for to group our boroughs using the groupby method:

In [None]:
boroughs = df.groupby('Borough')

4.2.5	The boroughs `DataFrameGroupBy` object is a special type of DataFrame that has additional methods available based on what groups we have specified (in this case the borough). For example, when we use the head method it looks similar to a normal DataFrame

In [None]:
boroughs.head()

But when we try to get the shape of the object we find it's slightly different from a normal DataFrame (you should get an error): 

In [None]:
boroughs.shape[0]

4.2.6	So we can’t use shape to find out how many elements in the boroughs DataFrameGroupBy object, but we can use our old favourite function len (which works pretty much anywhere!). Compare the output for the next two lines of code:

In [None]:
len(boroughs) 

In [None]:
len(df)

The length of boroughs is the number of groups in the DataFrameGroupBy object, whereas the length of df is the number of rows in the DataFrame object. Check you understand the difference! We can tell from this that there are 33 boroughs (groups) and 4835 LSOAs (rows).
4.2.7	The difference between df and boroughs also results in different output for other methods. Compare the output of the following:

In [None]:
boroughs["GEO_LABEL"].count()

#or

df.boroughs.count()

See how the count method for the DataFrameGroupBy object gives the count of LSOAs within in each borough (group) whereas the count method for the DataFrame object simply gives the count of the total number of LSOAs (rows). Note, the following two lines of input code do exactly the same as the last two but with slightly different notation (known as dot notation): 

In [None]:
boroughs.GEO_LABEL.count()
df.GEO_LABEL.count()

Dot notation looks a bit nicer but isn’t always as flexible as using [].
1.1.1	There are other methods we can use on DataFrameGroupBy, for example get_group gets the data (for LSOAs) for just one of the groups (boroughs):

In [None]:
boroughs.get_group("City of London")  

1.1.2	Using the DataFrameGroupBy object also allows us to describe the data by group (rather than for all of the LSOAs as we did before). For example, to find the mean values for the columns by borough we can use the aggregate method. 

In [None]:
bMeans = boroughs.aggregate(np.mean) 
print(bMeans)

Note how the aggregate method makes a call to the numpy function mean (the code at In [72] assumes you did import numpy as np). The aggregate method returns a DataFrame; check this by using type(bMeans). Check you understand what has been produced here; the mean of all columns in our original NS-SeC dataset but aggregated (grouped) by borough. (If we had added our Boroughs column to Excel we would use a Pivot Table to get similar output). 

1.1.3	Let’s see how this all might be useful for answering a geographical question. Say we want to calculate what proportion of the population of the Borough of Harrow that is ‘lower managerial’ (Group 2). The code to do this is shown in below

In [None]:
#What proportion of population of Harrow is lower managerial?  (code in Figure 2 in practical handout)
boroughs = df.groupby('Borough')                                     #as In[62]  
bSums = boroughs.aggregate(np.sum)                                   #sum of columns grouped by borough
harrow_sumG2 = bSums.Group2.loc["Harrow"]                            #sum of Group 2 for LSOAs in Harrow  #equivalent using dot notation is harrow_sumG2 = harrow.Group2.sum()   
harrow_sumTot = bSums.Total.loc["Harrow"]                            #sum of Total for LSOAs in Harrow    #equivalent using dot notation is harrow_sumTot = harrow.Total.sum() 

#convert to float when making calculation
harrow_propG2 =  float(harrow_sumG2) / float(harrow_sumTot)          #calculate proportion 
print "The proportion of Harrow in Group2 is:", str(harrow_propG2)   #print nicely

Run the code above and check that you find that the proportion is 0.236. Also check you understand how (and why) harrow_SumG2 and harrow_sumTot were created – they were created by indexing the bSums DataFrame created by aggregate. What happens if you do not force these to be float when calculating the proportion?

# Merging and Joining Data

Up until this point we have been working with individual datasets. But what if we have data in two datasets that we want to work with together, for example to look for correlations between variables. This will be useful for example to examine relationshis between weather and air quality variables or between socio-economic classifications and amenity values in neighbourhoods. 

In this practical we will look at how to `merge` and `join` Pandas dataframes. There's also a `concatenate` function. Each of these functions are slightly different:
- `merge` enables us to [combine](http://pandas.pydata.org/pandas-docs/stable/merging.html) two dataframes based on a column that is common between them
- `join` is used to [combine](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) two dataframes when they share a common index (e.g. a DateTimeIndex)
- `concatenate` [combines](http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects) dataframes regardless of common attributes. 

Here we will look at how to use `merge` to combine the NS-Sec data with additional data for the LSOAs into a single dataframe, and we'll see how `join` is useful to combine weather and air quality data for the same measurement times into a single dataframe. We'll then look at how to analyse these combined dataframes using correlation and regression next week.  

## Merge

If we have a column in each of two DataFrames that contains the same identifier for the remaining data, we can use the common identifier column to define how the two DataFrames are joined together. For example, the NS-SeC data are for LSOAs (distinct geographical regions) - if any additional data we have is also for LSOAs, as long as we we have a common way of identifying the LSOAs in both DataFrames we can merge the Data Frames. 

### NS-Sec and Amenity Values

The additional data you can use in conjunction with the NS-SeC data are found in `LSOA_ValuesData_London.csv` on KEATS. There are a variety of additional factors that you are free to explore, and you can read about them in the `AdditionalDataOverview.pdf` document also on KEATS. Smith (2010) used similar data in their study which will also likely help you to think about possible analyses you might make for your final report (e.g. between house prices and socio-economic indicators of LSOAs). 

These data are for housing and other amenity values for LSOAs in London. Consequently, we'll also use only NS-SeC data for London from now on - LSOA NS-SeC data for London only can be found in `Data_NSSHRP_UNIT_URESPOP_London.csv` on KEATS.

The code below loads the two data files for London LSOAs into memory as pandas DataFrames, tidies up their column names and drops rows with missing data.

In [None]:
#read NS-SeC data
nsCN = ["CDU_ID","GEO_CODE","GEO_LABEL","F2084","F2085","F2094","F2102","F2107","F2114","F2119","F2127","F2133","F2136"]  
nsDF = pd.read_csv('Data_NSSHRP_UNIT_URESPOP_London.csv', header=0, skiprows=[1], usecols=nsCN)   #read csv with headers, skipping notes row and no data column 15
nsDF.columns = ["CDU_ID","GEO_CODE","GEO_LABEL","Total","Group1","Group2","Group3","Group4","Group5","Group6","Group7","Group8","NC"]  
nsDF = nsDF.dropna(axis = 0)  #drop rows with missing data

#read Additional Values Data
valCN = ["lsoa11cd","median_price","avg_distance_to_station","positive_area","moderate_area","negative_area"]
valDF = pd.read_csv('LSOA_ValuesData_London.csv', header=0, usecols = valCN)  
valDF = valDF.dropna(axis = 0)  #drop rows with missing data

If we check these data we have just read into memory, we can see that the column in `valDF` named `lsoa11cd` uses the same labels for LSOAs as the `GEO_CODE` column in `nsDF`. Very handy!  

In [None]:
#check what the common columns are
nsDF.head()
valDF.head()

By renaming `lsoa11cd` to `GEO_CODE` we can use it with the Pandas `merge` function:

In [None]:
valDF.columns = ["GEO_CODE","MedPrice","MeanStationDist", "PosArea", "ModArea", "NegArea"]  #rename to 'GEO_CODE'!

The `merge` of `nsDF` and `valDF` is then done ‘on’ the `GEO_CODE` column found in each DataFrame. 

In [None]:
#merge the two data frames 
nsvalDF = pd.merge(nsDF, valDF, on = 'GEO_CODE')

Each value in the `GEO_CODE` column in `nsDF` is matched with the same value in the `GEO_CODE` column in `valDF` and the rows those values are found in are combined. The figure below illustrates the process (combining `left` and `right` on `key`). Read more about merge [here](http://pandas.pydata.org/pandas-docs/stable/merging.html). 

![Illustration of the Pandas merge function](http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key.png)

Check you understand how something similar has been done for our LSOA data, combining `nsDF` and `valDF` on `GEO_CODE`:

In [None]:
#check output
nsvalDF.head()
nsvalDF.info()

Let's save these data for later - they may be useful for your final report!

In [None]:
nsvalDF.to_csv("LondonLSOAData.csv")
nsvalDF.to_pickle("LondonLSOAData.pkl") 

## Join

The `merge` functions uses a common column (Series) in two dataframes to combine them. If the _index_ of two dataframes is common we could also use `merge` to combine on the index. However, we would need to pass more arguments to the `merge` function, and another function called `join` has been designed specifically to combine on dataframe indexes. 

Here, we'll join some air quality time-series data to our weather time-series data using a common `DateTimeIndex`. 

First, so we can join it later, we'll load (and check) the weather data from our previous data manipulation (week 7):

In [None]:
metDF = pd.read_pickle("CleanedHeathrowData2016.pkl")
print metDF.info()
print metDF.tail()

### Air Quality Data

The additional data we'll use with the Heathrow Weather data are air quality data have been downloaded from the Air Quality England [website](http://www.airqualityengland.co.uk/) (AQE 2016) for the [Hounslow Hatton Cross site](http://www.airqualityengland.co.uk/site/latest?site_id=HS7) (site HS7). This site was chosen as it is near Heathrow Airport. 

Air pollution is an important aspect of the ongoing argument about the construction of the third runway at Heathrow (e.g. GLA 2012). In particular, although Nitrogen Dioxide (NO2) concentrations around Heathrow, are lower than in the centre of London, they are still often above recommended levels (e.g. Heathrow 2012). By looking at relationships between weather and air quality we may begin to better understand the drivers of pollution.

However, as the air quaity data have also been automatically collected, we'll need to do some cleaning and manipulation of those data before we can join them with the weather data. 

### Cleaning Air Quality Data

When reading the data (in the next code block) we will read only the first 10 columns of the data to a DataFrame named `aqDF`, accounting for the need to skip lines. Also note there is a footer in the data so we use the `skipfooter` argument, but this also means we need to add the `engine` argument (read more about this in the `read_csv` [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) online).

### Now we're ready to join!

The [syntax](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) for the `join` function is quite straight-forward but can depend on the names of existing columns. For our current air quality and weather data frames we would `join` as follows: 

In [None]:
aqmetDF = aqDF.join(metDF, lsuffix='_l', rsuffix='_r')
aqmetDF.info()

The above creates a new dataframe (`aqmetDF`) from `aqDF` and `metDF`. We need to specify `lsuffix` and `rsuffix` as we have `Date` and `Time` columns in both of our dataframes - the suffixes are be added to the original columns in the new dataframe created so that we don't have duplicate columns names.

If we didn't have duplicate column names wouldn't need suffixes, so let's keep only `float64` series for the weather data, dropping `WindGust` and `LocID` (as they are not as interesting as the other variables):  

In [None]:
metDF = metDF.select_dtypes(include=['float64'])
del metDF['WindGust'] 
del metDF['LocID']

Now that we have no duplicate column names in our dataframes (check using `.info()` if you like), our useage of `join` is much simpler:

In [None]:
aqmetDF = aqDF.join(metDF)
aqmetDF.info()
aqmetDF.head()

Let's save this new combined dataframe to disk as it could also be useful for your final report. Before we do so, let's check if there's any values we want to drop from or change in the data using a pairplot to visualise:

## Summary

So we have now created two dataframes of combined data and saved these to disk:
- `LondonLSOAData.pkl`
- `HeathrowAQWeather2016.pkl`

You can load these data into a pandas dataframe easily using `read_pickle` and use them in your final reports.

## Exercises

You now have the 'full' data set that you can use for your final report. So start exploring!

If you want to join your own data for analysis in your final report, please discuss with James or Jon before doing so. 

Below are some exercises to help you think about the data you have created - feel free to work on all the exercises but you should be beginning to think about which data set you will focus on for your final report. 